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Abstract 

Background: Microarray-based comparative genomic liybridization (aCGH) is used for rapid comparison of genomes 
of different bacterial strains. The purpose is to evaluate the distribution of genes from sequenced bacterial strains 
(control) among unsequenced strains (test). We previously compared the use of single strain versus multiple strain 
control with arrays covering multiple genomes. The conclusion was that a multiple strain control promoted a better 
separation of signals between present and absent genes. 

Findings: We now extend our previous study by applying the Expectation-Maximization (EM) algorithm to fit a mixture 
model to the signal distribution in order to classif/ each gene as present or absent and by comparing different methods 
for analyzing aCGH data, using combinations of different control strain choices, two different statistical mixture models, 
with or without normalization, with or without logarithm transformation and with test-over-control or inverse signal 
ratio calculation. We also assessed the impact of replication on classification accuracy. Higher values of accuracy have 
been achieved using the ratio of control-over-test intensities, without logarithmic transformation and with a strain mix 
control. Normalization and the type of mixture model fitted by the EM algorithm did not have a significant impact on 
classification accuracy. Similarly, using the average of replicate arrays to perform the classification does not 
significantly improve the results. 

Conclusions: Our work provides a guiding benchmark comparison of alternative methods to analyze aCGH results 
that can impact on the analysis of currently ongoing comparative genomic projects or in the re-analysis of published 
studies. 
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Findings 

Background and purpose 

This manuscript is an update of a previous work [1] on 
the analysis of microarray based comparative genomic 
hybridization (aCGH) using multistrain arrays. aCGH is 
a tool used in the investigation of the genetic content of 
closely related microorganisms and screening for viru- 
lence factors [2,3]. 

The idea behind this technology is to generate micro- 
arrays from sequenced genomes, and then hybridize 
genomic DNA from other sources to these arrays. The 
aim is to detect similarities and differences in genomic 
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content through the characterization of test and control 
samples given the genes common to both samples and 
those that are specific to the control sample [2]. 

In two color aCGH experiments, the control sample 
is composed of DNA from one or more strains with se- 
quenced genomes represented in the array. The test strain 
is composed of DNA from a non-sequenced strain. Each 
sample is labeled with a different fluorochrome and 
hybridized to the microarray [1]. 

The analysis of aCGH experiments aims to classify the 
genes as present or absent. Many approaches have been 
applied [2,4-7], but most of these methods base their 
results on the logarithm of the intensity ratio (log-ratio 
or LR = log2 (T/C)) of both fluorescent signals (test (T) 
and control (C)). Genes present in both genomes will give 
a clear fluorescence signal in both channels (LR close to 
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0), and absent genes in the test sample are expected only 
to have a signal in the control channel (negative LR). 

Concerning the control used when arrays are designed 
with multiple sequenced genomes, there is no consensus 
about the best option. Some studies have used a single 
strain control [8-11] while others adopted a mixed con- 
trol approach [6,12]. Some of the studies argue that 
a multi-strain mix control, while generating a control 
signal for every spot on the microarray, leads to an in- 
creased signal intensity of core genome spots (that rep- 
resent genes present in all strains) in the control channel 
and can complicate data analysis based on ratios [9,11]. 
In order to clarify these approaches, we previously per- 
formed a comparative study with a multi-strain array for 
Streptococcus pneumoniae [1], covering the genomes of 
three sequenced strains: TIGR4 (T4), R6 (R6) and G54 
(G54). We used two different controls: a single strain 
(T4) and an equimolar mix of the three strains repre- 
sented in the array (T4 + R6 + G54) and hybridized them 
with the tests strains (R6 and G54). We concluded that 
the use of a single strain control increases the error rate 
in genes that are part of the accessory genome, where 
more variation across unsequenced strains is expected, 
justifying the use of the mix control. This conclusion 
was derived from the comparison of the discriminatory 
power of the LR values, using the known presence/ab- 
sence classification for each gene. No statistical method 
was used to estimate the best threshold LR value to div- 
ide present from absent genes. This step is necessary in 
the real application scenario, were the genome sequence 
of the test strain is not available. The present work com- 
plements our previous study by applying an expecta 
tion-maximization (EM) algorithm to estimate the best 
\threshold for the same experimental results. We will 
re-validate our previous conclusions about the choice of 
control and optimize several processing steps of mi- 
croarray data analysis. In particular, we will evaluate the 
impact of the logarithm transformation and the norma- 
lization steps, which are common steps in analysis of 
expression microarrays and in CGH analysis. We also wish 
to test if using an inverse ratio (C/T instead of T/C or log2 
(T/C)) provides better results with the EM algorithm. In 
the traditional ratio the present signals distribute around 1 
and the absent signals are compressed between 0 and 1. 
The logarithm transformation alleviates this compression, 
but we hypothesize that the two classes of signals can be 
further separated if the C/T ratio is used. 

Experimental data 

Part of the data analyzed in this study was obtained with 
a multi-strain Streptococcus pneumoniae CGH microar- 
rays already analyzed in a previous study [1]. The array 
was designed at PFGRC/JCVI {Streptococcus pneumoniae 
Version 5), with 3425 oligonucleotide probes covering the 



genomes of three pneumococcal strains: R6, T4 and G54. 
Each probe was 70 base long and replicated four times 
in each array. The annotation file provided by the array 
manufacturer was used to define which gene, from any 
of the three reference genomes, was represented by 
which probe. For each different pair of test strain and 
control, four replicates were done, resulting in a total of 
16 hybridizations. The four different hybridizations were 
R6 versus T4, R6 versus MIX (T4 + R6 + G54), G54 versus 
T4 and G54 versus MIX (T4 + R6 + G54). Dye swaps 
were applied in each set of four replicates. Microarray 
hybridization raw data was deposited in the ArrayExpress 
public database with accession number E-MEXP-1390. 

The images of the microarrays were analyzed using 
Feature Extraction 9.1 software (Agilent Technologies, 
Palo Alto, CA). For each spot the signal was background 
corrected by subtracting the minimum feature signal 
in the array. The intensity-specific bias was removed 
through loess global normalization. The resulting spot 
average pixel intensities were used to compute several 
signal ratio measures: the test/control signal ratio (T/C) 
and the corresponding logarithm signal ratio (log (T/C)), 
and the control/test signal ratio (C/T), before and after 
normalization. Since each gene was spotted four times per 
array, data retrieved from each of the valid spots were 
averaged for a particular gene. 

Additional data analyzed in this study was obtained 
from a published dataset [2] using a Staphylococcus aureus 
array (PFGRC/JCVI Staphylococcus aureus Version 5). Raw 
data files were imported from ArrayExpress (E-MEXP- 
2007). The hybridizations with a test strain represented 
in the array design were selected. This selection allowed 
the definition of which gene was present or absent in 
the test sample by using the information available in the 
array annotation files provided by the manufacturer. All 
the hybridizations used a single strain control. The ana- 
lysis was conducted using only the probes representing 
a gene present in the control sample. Raw data files 
were generated by GenePix image analysis software. For 
each probe, the F633 median and F532 median were 
used to define the T and C signals. 

Instead of hybridizations where the test strain is se- 
quenced, we could have tested the different analysis 
methods with microarray results of genes that were con- 
firmed by PCR assays. Although these assays are com- 
mon in published studies, they are not well suited for 
our purpose. In each dataset the number of genes con- 
firmed with PCR is normally low. We would have to 
analyze a high number of experiments to achieve a reason- 
able statistical confidence on the results. Additionally, 
PCR results may disagree with microarray result, even if 
both are physically correct. If the array is based on short 
oligos, that sequence can be conserved while the regions 
targeted by the PCR primers can be divergent (or the 
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other way around). Moreover, the selection of genes for 
PGR verification is not random. Genes can be selected 
because they are biologically more interesting, or be- 
cause they are more likely to confirm the microarray re- 
sults. This bias would make it more difficult to see any 
accuracy differences between G/T and T/G ratios using 
genes verified by PGR. 

Multi strain spot signal correction 

The Streptococcus pneumoniae microarrays were designed 
with three reference strains. The control sample can be a 
mixture of DNA of the same three strains. Thus, the mi- 
croarrays where the samples are hybridized contain se- 
quences that identify genes in one strain (group 1), two 
strains (group 2) or three strains (group 3). According to 
our previous study [1] when the mix control is used and 
the gene is present in the test strain, it is expected that the 
ratio of intensities T/G should be 3 if the gene is present 
in one strain (group 1), 3/2 if the gene is present in two 
strains (group 2) and 1 if the gene is present in three 
strains (group 3). When the gene is absent of the test sam- 
ple, it is expected that T/G reach values close to 0. So we 
chose to analyze corrected T/G values by multiplying the 
factors, 1, 2/3 or 3, according to each gene group (1, 2 or 
3, respectively). To obtain G/T corrected values the inverse 
factors are applied. When the single strain (T4) control is 
used, there are also two groups of spots: some contain 
sequences that identify T4 genes (group 1) while others 
contain sequences that do not identify T4 genes (group 2). 
In this case we only need to correct the signal for the 
group 2. Here, to calculate the signal ratio, the value of G 
is exchanged by a surrogate value that is the average of the 
G values for group 1 spots. All these corrections allow the 
application of the same classification algorithm to all spots 
in the array. 

Present/absent classification 

For each different signal ratio measure we applied the 
Estimation-Maximization (EM) algorithm to fit mixture 
models and to classify genes as present or absent. A 
mixture model is a convex combination of probability 
distributions that allows modeling data sets composed 
by different subsets, each of which is modeled by its 
own distribution. The mixture models used in this study 
were: the Normal-Uniform (NU) model and Gamma- 
Gamma (G) model. The first considers that signal ratios 
from present genes are shaped by the Normal distribution, 
while the missing genes are modeled by the Uniform 
distribution [13]. The Gamma-Gamma mixture model 
assumes that the signal ratio follows a Gamma distribu- 
tion both for present and for absent genes, although 
with different parameters [14]. 

The EM algorithm is an efficient tool in the estimation 
of mixture model parameters and classification of each 



gene in one of two distinct groups. Iteratively, it finds the 
parameters of both distributions in the mixture model that 
maximize the likelihood of obtaining the given observa- 
tions. At each step it computes for each gene the probabil- 
ity that it belongs to the present gene distribution. After 
the algorithm converges to a stable parameter set, each 
gene is considered present if the probability that it belongs 
to the present gene distribution is greater than 0.5. 

In this work the EM algorithm successfully converged 
in all occasions. The G model was not fitted to log(T/G) 
ratios because they present negative values, for which 
the Gamma distribution is not defined. Log(G/T) ratios 
were not evaluated because they have the symmetrical 
value of the log(T/G) ratios. As such, similar but sym- 
metric fits would be expected for both ratios, yielding 
the same classification accuracies. 

Evaluation of classifications 

The produced classifications were compared with the 
known classification. For each classification we identi- 
fied each gene as a true positive (TP) or true negative 
(TN) if the EM classification agreed with the known 
genome, and as a false positive (FP) or false negative 
(FN) if the EM classification did not agree. A gene is 
considered positive (either TP or FP) when the EM clas- 
sification predicted that gene to be present. With the 
number of TP, TN, FP and FN we can calculate the ac- 
curacy (Acc), that is, the proportion of correct results. 

TP + TN 
TP + TN + FP + FN 

To compare the results obtained with the mix and the 
single strain control, Wilcoxon rank sum tests for inde- 
pendent samples were used to detect significant differ- 
ences in the respective average accuracies, obtained from 
8 independent arrays for each control choice. To compare 
the results obtained with different signal ratios of mixture 
models, Wilcoxon signed rank tests for paired samples 
were used, since the variant methods were applied to the 
same replicate arrays. Differences were considered signifi- 
cant when p <0.05. 

Results 

Figure 1 shows the accuracy values for the pneumococ- 
cal arrays. The combinations of methods that have been 
tested were ordered according to the mean accuracies 
achieved. The highest accuracies were obtained with 
the mixed strain control, using the G/T ratio without 
normalization or logarithm transformation and fitting 
a normal-uniform mixture model. 

To compare the performance of both control types, 
the best analysis options for each kind of control were 
used. The mixed strain control (with G/T ratio, NU model 
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Figure 1 Accuracy for the methods of analysis applied. Each dot represents one Streptococcus pneumoniae array. The left plot represents the 
8 arrays using the TIGR4 control, and the rights plot the 8 arrays using the mix control. The bottom boxes along the x-axis identify the different 
combinations of methods tested: the mixture model is either based on the Gamma distribution (Gamma) or in the Normal and Uniform distributions 
(Norm./Unif.); ratios can be normalized with the loess procedure (Normalization); and the signal ratio can be Test over Control signal (T/C), logarithm of 
the Test over Control (Log(J/C) or Control over Test (C/T). The order of the different method combinations is defined by the resulting mean accuracy. 
Each method combination was applied to all the 16 arrays. 



and without normalization) reaches accuracies that are 
on average 0.05 superior (p < 0.001) to the single strain 
control (with normalized C/T ratio and NU model). This 
accuracy difference may translate in more 50 to 350 cor- 
rectly classified genes, considering bacterial genomes with 
1000 to 7000 genes. 

The impact of normalization and the mixture model 
choice are not clear from the analysis of Figure 1. Re- 
sults obtained with respect to the mixture model used 
are contradictory when the ratio is C/T or T/C. NU 
mixture model performs better with C/T ratios, while G 
mixture model produces higher accuracies when applied 
to T/C ratios. 

The choice of the ratio has a clear impact in the 
resulting accuracies. Independently of the control type, 
C/T ratios perform better than T/C ratios, and log(T/C) 
ratios present intermediate accuracies (Figure 1). To 
confirm this pattern, we re-analyzed a set of 6 CGH 
hybridizations where genomic DNA from sequenced 
Staphylococcus aureus strains was hibridyzed with a 
5. aureus microarray using a single strain control. Al- 
though these arrays are also based on 70-mer probes, 
they were designed for a different species and analyzed 
with different image analysis software. Figure 2 shows 
for each of the 16 pneumococcal, plus the 6 5'. aureus 



microarrays the accuracies obtained by using the T/C, Log 
(T/C) or C/T ratio. The pattern observed in Figure 2 is ex- 
tremely consistent across arrays. The C/T ratios lead to 
0.05 average increase in accuracy (p < 0.001) when com- 
pared with the Log(T/C) ratio applied to the same array. 

To understand the impact of the ratio choice in the 
mixture model fit, we analyzed the distribution of ratio 
values in one of the pneumococcal arrays (Figure 3). It is 
possible to observe that if the gene is present in the test 
strain (T 0), all the three ratios can be represented in 
terms of location by Normal distributions and, in the 
case of T/C and C/T ratios, by Gamma distributions. 
However, the fitted distributions show more dispersion 
than the histograms, not achieving sufficiently high 
densities in the distribution peaks. The Normal distribu- 
tion does not fit to any ratio distributions when the gene 
is not present in the test strain (T = 0). T/C and C/T ra- 
tios are better described by Gamma distributions when 
T = 0. This observation would predict better accuracies 
for the G mixture model with the T/C and C/T ratios. 
This prediction is correct for the T/C ratio, but the NU 
model leads to higher accuracies for the C/T and Log 
(C/T) ratios (Figure 1). The distributions of these two ra- 
tios when the T = 0 do not appear to be uniform. However, 
when compared with the corresponding distributions 
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S. pneumoniae arrays with mix control 




S. pneumoniae arrays witli TIGR4 control 




S. aureus arrays with single strain control 

Figure 2 Array-wise accuracy values with different signal ratios. The blue bars correspond to Test over Control ratio (T/C), green bars to the 
logarithm of the Test over Control ratio (Log(J/C) and the brown bars to the Control over Test ratio (C^. Each cluster of three bars was obtained 
from the same array. The top row shows results for Streptococcus pneumonioe arrays with the mix control, the center row for 5. pneumonioe 
arrays with the TIGR4 control and the bottom row for Staphylococcus aureus arrays using a single strain control. In all arrays a Normal-Uniform 
mixture model was adjusted to non-normalized signal ratios. 




Figure 3 Histograms of signal ratio values. A and D represent histograms of the Test over Control ratio (T/C) values. B and E represent 
histograms of the logarithm of the Test over Control ratio (Log(T/C) values. C and F represent histograms of the Control over Test ratio (C^ values. A, B 
and C histograms were drawn with ratio values of probes where the gene was absent in the test sample. D, E and F histograms were drawn with ratio 
values of probes where the gene was present in the test sample. Red lines are best fits of a Normal distribution to the drawn histogram. Blue lines are best 
fits of the Gamma distribution to the corresponding histograms. Due to the negative values of some Log (T/C) ratios, it was not possible to fit a Gamma 
distribution to the histograms in B and E. All the histograms where based on one Streptococcus pneumoniae microarray using the mix control. 

V . J 
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when T ;^ 0, they have much lower densities across a much 
larger range of values. This might explain the success 
of the NU model with C/T and Log (T/C) ratios. When 
T = 0, the distribution of T/C presents high densities, 
comparable with the densities of the T/C distribution 
when T 7^ 0. Additionally, for T = 0, the T/C values range 
from 0 to 1, presenting a large overlap with the T/C dis- 
tribution when T 7^ 0. It is possibly this overlap that is 
responsible for the lower accuracies obtained when the 
T/C ratio is used. 

Aiming to quantify the advantage in averaging replicate 
arrays, we combined the four available replicates of each 
hybridization into the possible sets of 2, 3 or 4 replicates 
and evaluated the accuracy of the resulting classifications, 
shown in Figure 4. In this evaluation we used the C/T ra- 
tios with the NU mixture model, as this option produced 
the best results without replication. Although there is a 
small increment in accuracy due to replication, no signifi- 
cant difference is found, justifying the use of just one or 
two replicates per sample in this particular application. 

Discussion 

Other authors [2,3] have previously recognized that some 
methods applied to analyze aCGH experiments were 
inherited from microarray gene expression analysis, with- 
out specifically evaluating their adequacy for aCGH data. 
Our results highlight this problem, showing that loess 
normalization generally has no significant impact in gene 
classification as present or absent. 

Logarithm transformation also has a negative impact 
when compared with the C/T ratio. In the analysis of 
gene expression, the aim is to identify genes that are 
either over or under expressed. The logarithm trans- 
formation is useful to make both types of changes quan- 
titatively comparable. In aCGH experiments applied to 
bacterial genomes, genes are either present or absent 



TIGR4 control 



Number of replicates 

Figure 4 Accuracy obtained with replicate averaging. The blue 
(red) dots correspond to accuracies of different arrays or combinations 
of replicate arrays using the mix control (TIGR4 control). The blue (red) 
line represents the evolution of the mean accuracies with increasing 
replicate numbers for the mix (TIGR4) control. 



(which includes genes with divergent sequences). The 
use of the C/T ratio expands the signal scale in the most 
interesting range for this type of analysis, facilitating the 
statistical segregation of signal distributions between 
present and absent genes. We should note that using 
the C/T ratio instead of log(T/C) (or of T/C ratio) only 
reverses the order of observed gene signals, without 
ranking changes. This implies that the optimal threshold 
separating present and absent genes with all three ratio 
variants will produce the same maximum accuracy. What 
changes is the capacity of the EM algorithm to adjust the 
mixture model to the available data, leading to better re- 
sults with the C/T ratio. Previous studies have showed that 
fitting mixture models to aCGH data provided better re- 
sult than several other methods [7,15]. Their comparison 
was based in the use of log(T/C) ratios. Our study reveals 
that the success of mixture model fitting in aCGH analysis 
can be improved with the use of the C/T ratio. 

The present results also confirm our previous study 
[1], in which the advantages of the use of a mix control 
were first observed. We believe that the presence of a 
present signal in the control channel of all the spots in 
the array is important for the resulting classification 
obtained through the fitting of a mixture model to the 
observed C/T signal ratios. The choice of a mix or a sin- 
gle strain control is posed when working with arrays rec- 
ognizing genes from multiple genomes. As the number 
of available genome in the array increases, the mix con- 
trol may loose its superiority, as genes that are only 
present in one of the reference genomes will have an 
increasingly weak control signal, approaching 0. Other 
authors [2] have presented alternative methods for the 
analysis of bacterial aCGH experiments that use the 
signal distribution of the control channel independ- 
ently of the test channel. These methods apparently 
reduce the need for a mix control. Nevertheless, our 
work shows that the mix control performs significantly 
better with an array designed for three reference ge- 
nomes and analyzed with a mixture model approach. 

Nowadays, comparative genomic studies in bacteria 
are migrating from the microarray technology to high 
throughput sequencing methods. This happens due to 
the decreases in the costs associated with sequencing, but 
also due to a gain in information: from present-absence 
status to a detection of diverging sequences. Still, some 
ongoing projects keep using microarrays, justifying the 
usefulness of the methodological comparison we are pre- 
senting [16-22]. Furthermore, our conclusions about the 
impact of the C/T ratio can justify the re-analysis of many 
already published and publicly available datasets, as we 
have demonstrated by analyzing the S. aureus arrays. 

Our results must be interpreted taking into account 
some limitations of our study. We tested arrays from 
two species using a similar array design strategy and 
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manufacturer. Additionally, only one normalization algo- 
rithm and two different mixture models were compared. 
Although we are confident that the observations are 
sufficiently general to apply in other contexts, we cannot 
anticipate the effect of different genomes, microarray plat- 
forms, normalization procedures or mixture models using 
different distributions. We could, for example, expand our 
study to include a higher number of different organisms, 
in order to generalize our conclusions. The advantages of 
the C/T ratio are not necessarily valid for every organism. 
Nevertheless, as our study shows that for two species the 
choice of ratio has a significant impact in classification 
accuracy, we can conclude that, when working with a 
different organism, it is worthy to conduct a similar 
study to optimize the signal ratio used for analysis. 

Two aCGH analysis problems were not approached in 
our work. One is the detection of multiple gene copies. 
If there are more copies of a given gene in the genome 
of control strains than in the test strains, the gene can 
be erroneously classified as absent. From the array de- 
sign it should be possible to identify probes targeting 
genes with multiple copies in the control strains, and 
perform a separate analysis in case they where classified as 
absent. Probes with a single copy in the control strains 
and multiple copies in the test strain will most likely be 
classified as present. A subsequent analysis of probes 
classified as present can be developed to quantify the 
number of copies, but it is expectable that as the copy 
number grows, there should be saturation in the fluor- 
escence signal. 

The second analysis problem is the quantification of 
sequence divergence. Our analysis returns a probability 
of gene presence in the test strain. When that probability 
has intermediate values (around 0.5), one can suspect of a 
divergent gene. However, the same values can be equally 
originated by excessive noise. Even if there are very low 
noise levels, a sequence divergence determined by micro- 
array just implies a sequence difference along the region 
targeted by the microarray probe. The remaining sequence 
of the gene can be highly conserved or highly divergent. 
The method proposed by Snipen and colleagues [2] uses 
the signals from the control channel to adjust a quanti- 
tative function relating sequence divergence and signal 
intensity. Although this method gives a quantitative an- 
swer to this problem, it is still affected by the noise 
levels and by the diversity of divergent sequences in the 
control sample. Sequencing studies provide a better 
approach to detect and quantify divergent regions. 

Conclusions 

We performed a comparison of several alternative methods 
to analyze bacterial aCGH datasets. All alternatives had in 
common the use of the EM algorithm to adjust a mixture 
model to the observed signal distributions. The results 



validated a previous study advocating the use of a strain 
mix as a control in aCGH experiments with multi- genome 
arrays, instead of a single strain control. Additionally, we 
found that loess normalization or the choice of mixture 
model distributions did not have a clear impact on the re- 
sults. The method variant that induced a greater improve- 
ment in achieved classification accuracies was the use of a 
control-over-test signal ratio, which is the inverse of the 
ratio traditionally used to analyze this type of results. 
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